Sentiment analysis, 감성 분석

감성 분석은 자연어 처리(Natural Language Processing, NLP)의 하위분야로, 머신러닝을 적용하여 글쓰이의 성향에

따라 문서를 분류 하는 등의 작업을 할 수 있다. 이와같은 감성 분석(opinion mining)은 광범위한 NLP 연구 분야 중

인기 있는 하위 분야이다.

IMDb 영화 리뷰 데이터

긍정 또는 부정으로 레이블되어 있는 영화 리뷰 5만 개로 구성되어 있다.

긍정은 IMDb에서 별 여섯 개 이상 받은 영화이고, 부정은 IMDb에서 별 다섯 개 아래를 받은 영화이다.

https://ai.stanford.edu/~amaas/data/sentiment/

tar -zxf acImdb_v1.tar

혹은 아래의 python 코드를 통해서 tarfile의 압축을 풀 수 있다.

import tarfile

with tarfile.open('acImdb_v1.tar', 'r:gz') as tar:

tar.extractall()

Data Preprocess

pip3 install pyprind

// 예상 시간을 표시하는데 사용

50000번 반복하는 진행 막대 객체 pbar를 생성(읽어 들일 문서 개수와 같다)

import pyprind

import pandas as pd

import os

basepath='aclImdb'

labels={'pos':1, 'neg':0}

pbar=pyprind.ProgBar(50000)

df=pd.DataFrame()

for s in('test', 'train'):

for l in ('pos', 'neg'):

path=os.path.join(basepath, s, l)

for file in sorted(os.listdir(path)):

with open(os.path.join(path, file), 'r', encoding='utf-8') as infile:

txt=infile.read()

df=df.append([[txt, labels[l]]], ignore_index=True)

pbar.update()

df.columns=['review', 'sentiment']

pos label: 0

neg label: 1

데이터 프레임 섞기, CSV 파일로 저장

import numpy as np

np.random.seed(0)

df=df.reindex(np.random.permutation(df.index))

df.to_csv('movie_data.csv', index=False, encoding='utf-8')

이후에는 아래 코드를 통해서 데이터셋을 다시 얻을 수 있다.

df=pd.read_csv('movie_data.csv', encoding='utf-8')

df.head(3)

review sentiment

0 at a Saturday matinee in my home town. I went ... 0

1 I love this movie. It is the first film Master... 1

2 In the voice over which begins the film, Hughi... 1

df.shape

(50000, 2)

Sentiment analysis & Get Data